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Abstract 

Given  a  collection  of  discrete  random  variables 
representing  outcomes  of  learned  local  predic¬ 
tors  in  natural  language,  e.g.,  named  entities 
and  relations,  we  seek  an  optimal  global  as¬ 
signment  to  the  variables  in  the  presence  of 
general  (non-sequential)  constraints.  Examples 
of  these  constraints  include  the  type  of  argu¬ 
ments  a  relation  can  take,  and  the  mutual  activ¬ 
ity  of  different  relations,  etc.  We  develop  a  lin¬ 
ear  programming  formulation  for  this  problem 
and  evaluate  it  in  the  context  of  simultaneously 
learning  named  entities  and  relations.  Our  ap¬ 
proach  allows  us  to  efficiently  incorporate  do¬ 
main  and  task  specific  constraints  at  decision 
time,  resulting  in  significant  improvements  in 
the  accuracy  and  the  “human-like”  quality  of 
the  inferences. 

1  Introduction 

Natural  language  decisions  often  depend  on  the  out¬ 
comes  of  several  different  but  mutually  dependent  predic¬ 
tions.  These  predictions  must  respect  some  constraints 
that  could  arise  from  the  nature  of  the  data  or  from  do¬ 
main  or  task  specific  conditions.  For  example,  in  part-of- 
speech  tagging,  a  sentence  must  have  at  least  one  verb, 
and  cannot  have  three  consecutive  verbs.  These  facts  can 
be  used  as  constraints.  In  named  entity  recognition,  “no 
entities  can  overlap”  is  a  common  constraint  used  in  var¬ 
ious  works  (Tjong  Kim  Sang  and  De  Meulder,  2003). 

Efficient  solutions  to  problems  of  these  sort  have  been 
given  when  the  constraints  on  the  predictors  are  sequen¬ 
tial  (Dietterich,  2002).  These  solutions  can  be  cate¬ 
gorized  into  the  following  two  frameworks.  Learning 
global  models  trains  a  probabilistic  model  under  the  con¬ 
straints  imposed  by  the  domain.  Examples  include  varia¬ 
tions  of  HMMs,  conditional  models  and  sequential  varia¬ 


tions  of  Markov  random  fields  (Lafferty  et  ak,  2001).  The 
other  framework,  inference  with  classifiers  (Roth,  2002), 
views  maintaining  constraints  and  learning  classifiers  as 
separate  processes.  Various  local  classifiers  are  trained 
without  the  knowledge  of  constraints.  The  predictions 
are  taken  as  input  on  the  inference  procedure  which  then 
finds  the  best  global  prediction.  In  addition  to  the  concep¬ 
tual  simplicity  of  this  approach,  it  also  seems  to  perform 
better  experimentally  (Tjong  Kim  Sang  and  De  Meulder, 
2003). 

Typically,  efficient  inference  procedures  in  both  frame¬ 
works  rely  on  dynamic  programming  (e.g.,  Viterbi), 
which  works  well  in  sequential  data.  However,  in  many 
important  problems,  the  structure  is  more  general,  result¬ 
ing  in  computationally  intractable  inference.  Problems  of 
these  sorts  have  been  studied  in  computer  vision,  where 
inference  is  generally  performed  over  low  level  measure¬ 
ments  rather  than  over  higher  level  predictors  (Levin  et 
ak,  2002;  Boykov  et  ak,  2001). 

This  work  develops  a  novel  inference  with  classifiers 
approach.  Rather  than  being  restricted  on  sequential  data, 
we  study  a  fairly  general  setting.  The  problem  is  defined 
in  terms  of  a  collection  of  discrete  random  variables  rep¬ 
resenting  binary  relations  and  their  arguments;  we  seek 
an  optimal  assignment  to  the  variables  in  the  presence  of 
the  constraints  on  the  binary  relations  between  variables 
and  the  relation  types. 

The  key  insight  to  this  solution  comes  from  re¬ 
cent  techniques  developed  for  approximation  algo¬ 
rithms  (Chekuri  et  ak,  2001).  Following  this  work,  we 
model  inference  as  an  optimization  problem,  and  show 
how  to  cast  it  as  a  linear  program.  Using  existing  numer¬ 
ical  packages,  which  are  able  to  solve  very  large  linear 
programming  problems  in  a  very  short  time',  inference 
can  be  done  very  quickly. 

Our  approach  could  be  contrasted  with  other  ap- 

*For  example,  (CPLEX,  2003)  is  able  to  solve  a  linear  pro¬ 
gramming  problem  of  13  million  variables  within  5  minutes. 


Report  Documentation  Page 

Form  Approved 

0MB  No.  0704-0188 

Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 

VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  0MB  control  number. 

1.  REPORT  DATE 

2QQ^  2.  REPORT  TYPE 

3.  DATES  COVERED 

00-00-2004  to  00-00-2004 

4.  TITLE  AND  SUBTITLE 

A  Linear  Programming  Formulation  for  Global  Inference  in  Natural 
Language  Tasks 

5a.  CONTRACT  NUMBER 

5b.  GRANT  NUMBER 

5c.  PROGRAM  ELEMENT  NUMBER 

6.  AUTHOR(S) 

5d.  PROJECT  NUMBER 

5e.  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

University  of  Illinois, Department  of  Computer  Science  ,Urbana,IL, 61801 

8.  PERFORMING  ORGANIZATION 

REPORT  NUMBER 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

10.  SPONSOR/MONITOR’S  ACRONYM(S) 

11.  SPONSOR/MONITOR’S  REPORT 
NUMBER(S) 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

14.  ABSTRACT 

15.  SUBJECT  TERMS 

16.  SECURITY  CLASSIFICATION  OF:  17.  LIMITATION  OF 

18.  NUMBER  19a.  NAME  OF 

a.  REPORT  b.  ABSTRACT  c.  THIS  PAGE 

unclassified  unclassified  unclassified 

8 

standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std  Z39-18 


proaches  to  sequential  inference  or  to  general  Markov 
random  field  approaches  (Lafferty  et  ah,  2001;  Taskar  et 
ah,  2002).  The  key  difference  is  that  in  these  approaches, 
the  model  is  learned  globally,  under  the  constraints  im¬ 
posed  by  the  domain.  In  our  approach,  predictors  do  not 
need  to  be  learned  in  the  context  of  the  decision  tasks, 
but  rather  can  be  learned  in  other  contexts,  or  incorpo¬ 
rated  as  background  knowledge.  This  way,  our  approach 
allows  the  incorporation  of  constraints  into  decisions  in  a 
dynamic  fashion  and  can  therefore  support  task  specific 
inferences.  The  significance  of  this  is  clearly  shown  in 
our  experimental  results. 

We  develop  our  models  in  the  context  of  natural  lan¬ 
guage  inferences  and  evaluate  it  here  on  the  problem  of 
simultaneously  recognizing  named  entities  and  relations 
between  them. 

1.1  Entity  and  Relation  Recognition 

This  is  the  problem  of  recognizing  the  kill  (KFJ,  Os¬ 
wald)  relation  in  the  sentence  “J.  V.  Oswald  was 
murdered  at  JFK  after  his  assassin, 

R.  U.  KFJ.  .  This  task  requires  making  several 
local  decisions,  such  as  identifying  named  entities  in  the 
sentence,  in  order  to  support  the  relation  identification. 
For  example,  it  may  be  useful  to  identify  that  Oswald 
and  KFJ  are  people,  and  JFK  is  a  location.  This,  in  turn, 
may  help  to  identify  that  the  kill  action  is  described  in  the 
sentence.  At  the  same  time,  the  relation  kill  constrains  its 
arguments  to  be  people  (or  at  least,  not  to  be  locations) 
and  helps  to  enforce  that  Oswald  and  KFJ  are  likely  to 
be  people,  while  JFK  is  not. 

In  our  model,  we  first  learn  a  collection  of  “local”  pre¬ 
dictors,  e.g.,  entity  and  relation  identifiers.  At  decision 
time,  given  a  sentence,  we  produce  a  global  decision  that 
optimizes  over  the  suggestions  of  the  classifiers  that  are 
active  in  the  sentence,  known  constraints  among  them 
and,  potentially,  domain  or  tasks  specific  constraints  rel¬ 
evant  to  the  current  decision. 

Although  a  brute-force  algorithm  may  seem  feasible 
for  short  sentences,  as  the  number  of  entity  variable 
grows,  the  computation  becomes  intractable  very  quickly. 
Given  n  entities  in  a  sentence,  there  are  possible 

relations  between  them.  Assume  that  each  variable  (en¬ 
tity  or  relation)  can  take  I  labels  (“none”  is  one  of  these 
labels).  Thus,  there  are  possible  assignments,  which 
is  too  large  even  for  a  small  n. 

When  evaluated  on  simultaneous  learning  of  named 
entities  and  relations,  our  approach  not  only  provides 
a  significant  improvement  in  the  predictors’  accuracy; 
more  importantly,  it  provides  coherent  solutions.  While 
many  statistical  methods  make  “stupid”  mistakes  (i.e., 
inconsistency  among  predictions),  that  no  human  ever 
makes,  as  we  show,  our  approach  improves  also  the  qual¬ 
ity  of  the  inference  significantly. 


The  rest  of  the  paper  is  organized  as  follows.  Section  2 
formally  defines  our  problem  and  section  3  describes  the 
computational  approach  we  propose.  Experimental  re¬ 
sults  are  given  in  section  4,  followed  by  some  discussion 
and  conclusion  in  section  5. 

2  The  Relational  Inference  Problem 

We  consider  the  relational  inference  problem  within  the 
reasoning  with  classifiers  paradigm,  and  study  a  spe¬ 
cific  but  fairly  general  instantiation  of  this  problem,  moti¬ 
vated  by  the  problem  of  recognizing  named  entities  (e.g., 
persons,  locations,  organization  names)  and  relations  be¬ 
tween  them  (e.g.  work_for,  locatedin,  liveJn).  We  con¬ 
sider  a  set  V  which  consists  of  two  types  of  variables  V  = 
£  LITZ.  The  first  set  of  variables  £  =  {Ei,E2,  •  •  • ,  En} 
ranges  Cs-  The  value  (called  “label”)  assigned  to  Ei  G  £ 
is  denoted  fEi  G  Cs-  The  second  set  of  variables 
TZ  =  {Rij} is  viewed  as  binary  relations 
over  £.  Specifically,  for  each  pair  of  entities  Ei  and  Ej, 
i  ^  j,  we  use  Rij  and  Rji  to  denote  the  (binary)  relations 
{Ei,Ej)  and  {Ej,Ei)  respectively.  The  set  of  labels  of 
relations  is  Cn  and  the  label  assigned  to  relation  Rij  G  TZ 

is  fRij  G  Cn- 

Apparently,  there  exists  some  constraints  on  the  labels 
of  corresponding  relation  and  entity  variables.  For  in¬ 
stance,  if  the  relation  is  liveJn,  then  the  first  entity  should 
be  a  person,  and  the  second  entity  should  be  a  location. 
The  correspondence  between  the  relation  and  entity  vari¬ 
ables  can  be  represented  by  a  bipartite  graph.  Each  rela¬ 
tion  variable  Rij  is  connected  to  its  first  entity  Ei  ,  and 
second  entity  Ej.  We  use  and  to  denote  the  entity 
variables  of  a  relation  Rij.  Specifically,  Ei  =  N^{Rij) 
and  Ej  =  N'^{Rij). 

In  addition,  we  define  a  set  of  constraints  on  the  out¬ 
comes  of  the  variables  in  V.  :  Cs  x  Cr  {0, 1} 
constraint  values  of  the  first  argument  of  a  relation. 
is  defined  similarly  and  constrains  the  second  argument 
a  relation  can  take.  Eor  example,  (bornJn,  person)  is 
in  but  not  in  because  the  first  entity  of  relation 
bornJn  has  to  be  a  person  and  the  second  entity  can  only 
be  a  location  instead  of  a  person.  Note  that  while  we 
define  the  constraints  here  as  Boolean,  our  formalisms 
in  fact  allows  for  stochastic  constraints.  Also  note  that 
we  can  define  a  large  number  of  constraints,  such  as 
:  Cr  X  Cr  — >  {0, 1}  which  constrain  types  of  re¬ 
lations,  etc.  In  fact,  as  will  be  clear  in  Sec.  3  the  language 
for  defining  constraints  is  very  rich  -  linear  (in)equalities 
over  V. 

We  exemplify  the  framework  using  the  problem  of  si¬ 
multaneous  recognition  of  named  entities  and  relations  in 
sentences.  Briefly  speaking,  we  assume  a  learning  mech¬ 
anism  that  can  recognize  entity  phrases  in  sentences, 
based  on  local  contextual  features.  Similarly,  we  assume 


a  learning  mechanism  that  can  recognize  the  semantic  re¬ 
lation  between  two  given  phrases  in  a  sentence. 

We  seek  an  inference  algorithm  that  can  produce  a  co¬ 
herent  labeling  of  entities  and  relations  in  a  given  sen¬ 
tence.  Furthermore,  it  follows,  as  best  as  possible  the 
recommendation  of  the  entity  and  relation  classifiers,  but 
also  satisfies  natural  constraints  that  exist  on  whether  spe¬ 
cific  entities  can  be  the  argument  of  specific  relations, 
whether  two  relations  can  occur  together  at  the  same 
time,  or  any  other  information  that  might  be  available  at 
the  inference  time  (e.g.,  suppose  it  is  known  that  enti¬ 
ties  A  and  B  represent  the  same  location;  one  may  like  to 
incorporate  an  additional  constraint  that  prevents  an  in¬ 
ference  of  the  type:  “C  lives  in  A;  C  does  not  live  in  B”). 

We  note  that  a  large  number  of  problems  can  be  mod¬ 
eled  this  way.  Examples  include  problems  such  as  chunk¬ 
ing  sentences  (Punyakanok  and  Roth,  2001),  coreference 
resolution  and  sequencing  problems  in  computational  bi¬ 
ology.  In  fact,  each  of  the  components  of  our  problem 
here,  the  separate  task  of  recognizing  named  entities  in 
sentences  and  the  task  of  recognizing  semantic  relations 
between  phrases,  can  be  modeled  this  way.  However, 
our  goal  is  specifically  to  consider  interacting  problems 
at  different  levels,  resulting  in  more  complex  constraints 
among  them,  and  exhibit  the  power  of  our  method. 

The  most  direct  way  to  formalize  our  inference  prob¬ 
lem  is  via  the  formalism  of  Markov  Random  Field  (MRF) 
theory  (Fi,  2001).  Rather  than  doing  that,  for  compu¬ 
tational  reasons,  we  first  use  a  fairly  standard  transfor¬ 
mation  of  MRF  to  a  discrete  optimization  problem  (see 
(Kleinberg  and  Tardos,  1999)  for  details).  Specifically, 
under  weak  assumptions  we  can  view  the  inference  prob¬ 
lem  as  the  following  optimization  problem,  which  aims 
to  minimize  the  objective  function  that  is  the  sum  of  the 
following  two  cost  functions. 

Assignment  cost:  the  cost  of  deviating  from  the  assign¬ 
ment  of  the  variables  V  given  by  the  classifiers.  The  spe¬ 
cific  cost  function  we  use  is  defined  as  follows:  Fet  I  be 
the  label  assigned  to  variable  u  &  V.\f  the  marginal  prob¬ 
ability  estimation  is  p  =  P{fu  =  1),  then  the  assignment 
cost  Cu{l)  is  —  logp. 

Constraint  cost:  the  cost  imposed  by  breaking  con¬ 
straints  between  neighboring  nodes.  The  specific  cost 
function  we  use  is  defined  as  follows:  Consider  two  en¬ 
tity  nodes  Ei,  Ej  and  its  corresponding  relation  node  Rif, 
that  is,  Ei  =  J\f^{Rij)  and  Ej  =  M‘^{Rij).  The  con¬ 
straint  cost  indicates  whether  the  labels  are  consistent 
with  the  constraints.  In  particular,  we  use:  d}  (/b.  ,  /ij. . ) 
is  0  if  jEi)  e  otherwise,  d'^ifEiJRa)  is  oo  ^ 
Similarly,  we  use  d^  to  force  the  consistency  of  the  sec¬ 
ond  argument  of  a  relation. 

^In  practice,  we  use  a  very  large  number  (e.g.,  9^®). 


Since  we  are  seeking  the  most  probable  global  assign¬ 
ment  that  satisfies  the  constraints,  therefore,  the  overall 
cost  function  we  optimize,  for  a  global  labeling  /  of  all 
variables  is: 

c{f)  =  E 

+  E  [d\fRiiJEj+d^{fR,EfEi)]  (1) 


3  A  Computational  Approach  to 
Relational  Inference 

Unfortunately,  it  is  not  hard  to  see  that  the  combinatorial 
problem  (Eq.  1)  is  computationally  intractable  even  when 
placing  assumptions  on  the  cost  function  (Kleinberg  and 
Tardos,  1999).  The  computational  approach  we  adopt  is 
to  develop  a  linear  programming  (FP)  formulation  of  the 
problem,  and  then  solve  the  corresponding  integer  lin¬ 
ear  programming  (IFP)  problem.  Our  FP  formulation  is 
based  on  the  method  proposed  by  (Chekuri  et  al.,  2001). 
Since  the  objective  function  (Eq.  1)  is  not  a  linear  func¬ 
tion  in  terms  of  the  labels,  we  introduce  new  binary  vari¬ 
ables  to  represent  different  possible  assignments  to  each 
original  variable;  we  then  represent  the  objective  function 
as  a  linear  function  of  these  binary  variables. 

Fet  x^u,i}  be  a  {0,  l}-variable,  defined  to  be  1  if  and 
only  if  variable  u  is  labeled  i,  where  u  €  £,i  €  Eg  or 
u  G  TZ,i  G  Cr.  For  example,  x^Eig}  =  1  when  the 
label  of  entity  Ei  is  2;  Xj^R^^  ^  =  0  when  the  label  of  re¬ 
lation  i?23  is  not  3.  Fet  X{R^.^r,Ei,e^}  be  a  {0,  l}-variable 
indicating  whether  relation  Rij  is  assigned  label  r  and 
its  first  argument,  Ei,  is  assigned  label  ei.  For  instance, 
=  1  means  the  label  of  relation  R12  is  1 
and  the  label  of  its  first  argument,  Ei,  is  2.  Similarly, 
X{R..^r,Ej,e2}  =  1  indicates  that  Rij  is  assigned  label  r 
and  its  second  argument,  Ej,  is  assigned  label  62.  With 
these  definitions,  the  optimization  problem  can  be  repre¬ 
sented  as  the  following  IFP  problem  (Figure  1). 

Equations  (2)  and  (3)  require  that  each  entity  or  rela¬ 
tion  variable  can  only  be  assigned  one  label.  Equations 
(4)  and  (5)  assure  that  the  assignment  to  each  entity  or 
relation  variable  is  consistent  with  the  assignment  to  its 
neighboring  variables.  (6),  (7),  and  (8)  are  the  integral 
constraints  on  these  binary  variables. 

There  are  several  advantages  of  representing  the  prob¬ 
lem  in  an  FP  formulation.  First  of  all,  linear  (in)equalities 
are  fairly  general  and  are  able  to  represent  many  types 
of  constraints  (e.g.,  the  decision  time  constraint  in  the 
experiment  in  Sec.  4).  More  importantly,  an  IFP  prob¬ 
lem  at  this  scale  can  be  solved  very  quickly  using  current 
commercial  FP/IFP  packages,  like  (Xpress-MP,  2003)  or 
(CPFEX,  2003).  We  introduce  the  general  strategies  of 
solving  an  IFP  problem  here. 


min 


subject  to: 


E  ^{Ex}  =  1 

e^Ce 

'iE&S 

(2) 

E  =  1 

reC-R 

Vi?  G 

(3) 

X{E,e}  ^  ^  X ^Rj-E .e} 

r&C-R 

VL;  G  £  and  Vi?  G  {i?  :  L;  =  Ai^(i?)  ov  R  :  E  =  A£{R)} 

(4) 

X{Rx}  ~  ^  ^  X^R  r  Ex} 
e&Ce 

Vi?Gie  and  \fE  =  J\f\R)ov  E  =A£{R) 

(5) 

X{E,e}  e  {0,  1} 

WE  €S,e€Cs 

(6) 

X{Rx}  S  {Oj  1} 

'iR  G  Cn 

(7) 

X{Rx,E,e}  G  {0,  1} 

Vi?  G  7?.,  T  G  7^7^,  E  ^  j  e  ^ 

(8) 

Figure  1 :  Integer  Linear  Programming  Formulation 
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3.1  Linear  Programming  Relaxation  (LPR) 

To  solve  an  ILP  problem,  a  natural  idea  is  to  relax  the 
integral  constraints.  That  is,  replacing  (6),  (7),  and  (8) 
with: 


X{E,e}  >  0 

\/E  G  £  j  6  G  C>s 

(9) 

IV 

0 

MRGlZ.r  G  Cn 

(10) 

X{Rx,E,e}  >  0 

Vi?  €  7^,r  e  Cn-> 

E  G  £  j  G  G  Es 

(11) 

If  LPR  returns  an  integer  solution,  then  it  is  also  the 
optimal  solution  to  the  ILP  problem.  If  the  solution  is 
non  integer,  then  at  least  it  gives  a  lower  bound  to  the 
value  of  the  cost  function,  which  can  be  used  in  modi¬ 
fying  the  problem  and  getting  closer  to  deriving  an  op¬ 
timal  integer  solution.  A  direct  way  to  handle  the  non 
integer  solution  is  called  rounding,  which  finds  an  inte¬ 
ger  point  that  is  close  to  the  non  integer  solution.  Un¬ 
der  some  conditions  of  cost  functions,  which  do  not  hold 
here,  a  well  designed  rounding  algorithm  can  be  shown 
that  the  rounded  solution  is  a  good  approximation  to  the 
optimal  solution  (Kleinberg  and  Tardos,  1999;  Chekuri  et 
al.,  2001).  Nevertheless,  in  general,  the  outcomes  of  the 
rounding  procedure  may  not  even  be  a  legal  solution  to 
the  problem. 


3.2  Branch  &  Bound  and  Cutting  Plane 

Branch  and  bound  is  the  method  that  divides  an  ILP  prob¬ 
lem  into  several  LP  subproblems.  It  uses  LPR  as  a  sub¬ 
routine  to  generate  dual  (upper  and  lower)  bounds  to  re¬ 
duce  the  search  space,  and  finds  the  optimal  solution  as 
well.  When  LPR  finds  a  non  integer  solution,  it  splits  the 
problem  on  the  non  integer  variable.  For  example,  sup¬ 
pose  variable  Xi  is  fractional  in  an  non  integer  solution  to 
the  ILP  problem  minjca;  :  a;  G  S',  a;  £  {0, 1}"},  where  S 
is  the  linear  constraints.  The  ILP  problem  can  be  split  into 
two  sub  LPR  problems,  min{cx  :  a;  G  Sfl  {xi  =  0}}  and 
minjca;  :  a;  G  Sfl  {xi  =  1}}.  Since  any  feasible  solution 
provides  an  upper  bound  and  any  LPR  solution  generates 
a  lower  bound,  the  search  tree  can  be  effectively  cut. 

Another  strategy  of  dealing  with  non  integer  points, 
which  is  often  combined  with  branch  &  bound,  is  called 
cutting  plane.  When  a  non  integer  solution  is  given  by 
LPR,  it  adds  a  new  linear  constraint  that  makes  the  non  in¬ 
teger  point  infeasible,  while  still  keeps  the  optimal  integer 
solution  in  the  feasible  region.  As  a  result,  the  feasible 
region  is  closer  to  the  ideal  polyhedron,  which  is  the  con¬ 
vex  hull  of  feasible  integer  solutions.  The  most  famous 
cutting  plane  algorithm  is  Gomory’s  fractional  cutting 
plane  method  (Wolsey,  1998),  which  can  be  shown  that 
only  finite  number  of  additional  constraints  are  needed. 
Moreover,  researchers  develop  different  cutting  plane  al¬ 
gorithms  for  different  types  of  ILP  problems.  One  exam- 


pie  is  (Wang  and  Regan,  2000),  which  only  focuses  on 
binary  ILP  problems. 

Although  in  theory,  a  search  based  strategy  may  need 
several  steps  to  find  the  optimal  solution,  LPR  always 
generates  integer  solutions  in  our  experiments.  This  phe¬ 
nomenon  may  link  to  the  theory  of  unimodularity. 

3.3  Unimodularity 

When  the  coefficient  matrix  of  a  given  linear  program 
in  its  standard  form  is  unimodular,  it  can  be  shown  that 
the  optimal  solution  to  the  linear  program  is  in  fact  inte¬ 
gral  (Schrijver,  1986).  In  other  words,  LPR  is  guaranteed 
to  produce  an  integer  solution. 

Definition  3.1  A  matrix  A  of  rank  m  is  called  unimodu¬ 
lar  if  all  the  entries  of  A  are  integers,  and  the  determinant 
of  every  square  submatrix  of  A  of  order  m  is  in  0,+l,-l. 

Theorem  3.1  (Veinott  &  Dantzig)  Let  A  be  an  (m,  n)- 

integral  matrix  with  full  row  rank  m.  Then  the  polyhe¬ 
dron  {x|x  >  0;  Ax  =  b}  is  integral  for  each  integral 
vector  b,  if  and  only  if  A  is  unimodular. 

Theorem  3.1  indicates  that  if  a  linear  programming 
problem  is  in  its  standard  form,  then  regardless  of  the 
cost  function  and  the  integral  vector  b,  the  optimal  so¬ 
lution  is  an  integer  if  and  only  if  the  coefficient  matrix  A 
is  unimodular. 

Although  the  coefficient  matrix  in  our  problem  is  not 
unimodular,  LPR  still  produces  integer  solutions  for  all 
the  (thousands  of  cases)  we  have  experimented  with.  This 
may  be  due  to  the  fact  that  the  coefficient  matrix  shares 
many  properties  of  a  unimodular  matrix.  As  a  result,  most 
of  the  vertices  of  the  polyhedron  are  integer  points.  An¬ 
other  possible  reason  is  that  given  the  cost  function  we 
have,  the  optimal  solution  is  always  integer.  Because  of 
the  availability  of  very  efficient  LP/ILP  packages,  we  de¬ 
fer  the  exploration  of  this  direction  for  now. 

4  Experiments 

We  describe  below  two  experiments  on  the  problem  of 
simultaneously  recognizing  entities  and  relations.  In  the 
first,  we  view  the  task  as  a  knowledge  acquisition  task 
-  we  let  the  system  read  sentences  and  identify  entities 
and  relations  among  them.  Given  that  this  is  a  difficult 
task  which  may  require  quite  often  information  beyond 
the  sentence,  we  consider  also  a  “forced  decision”  task, 
in  which  we  simulate  a  question  answering  situation  - 
we  ask  the  system,  say,  “who  killed  whom”  and  evaluate 
it  on  identifying  correctly  the  relation  and  its  arguments, 
given  that  it  is  known  that  somewhere  in  this  sentence 
this  relation  is  active.  In  addition,  this  evaluation  exhibits 
the  ability  of  our  approach  to  incorporate  task  specific 
constraints  at  decision  time. 

Our  experiments  are  based  on  the  TREC  data  set 
(which  consists  of  articles  from  WSJ,  AP,  etc.)  that  we 


annotated  for  named  entities  and  relations.  In  order  to 
effectively  observe  the  interaction  between  relations  and 
entities,  we  picked  1437  sentences  that  have  at  least  one 
active  relation.  Among  those  sentences,  there  are  5336 
entities,  and  19048  pairs  of  entities  (binary  relations).  En¬ 
tity  labels  include  1685  persons,  1968  locations,  978  or¬ 
ganizations  and  705  others.  Relation  labels  include  406 
locatedJn,  394  work  for,  451  orgBasedJn,  521  liveJn, 
268  kill,  and  17007  none.  Note  that  most  pairs  of  entities 
have  no  active  relations  at  all.  Therefore,  relation  none 
significantly  outnumbers  others.  Examples  of  each  rela¬ 
tion  label  and  the  constraints  between  a  relation  variable 
and  its  two  entity  arguments  are  shown  as  follows. 


Relation 

Entity  1 

Entity2 

Example 

locatedJn 

loc 

loc 

(New  York,  US) 

work_for 

per 

org 

(Bill  Gates,  Microsoft) 

orgBasedJn 

org 

loc 

(HP,  Palo  Alto) 

liveJn 

per 

loc 

(Bush,  US) 

kill 

per 

per 

(Oswald,  JEK) 

In  order  to  focus  on  the  evaluation  of  our  inference 
procedure,  we  assume  the  problem  of  segmentation  (or 
phrase  detection)  (Abney,  1991;  Punyakanok  and  Roth, 
2001)  is  solved,  and  the  entity  boundaries  are  given  to  us 
as  input;  thus  we  only  concentrate  on  their  classifications. 

We  evaluate  our  LP  based  global  inference  procedure 
against  two  simpler  approaches  and  a  third  that  is  given 
more  information  at  learning  time.  Basic,  only  tests  our 
entity  and  relation  classifiers,  which  are  trained  indepen¬ 
dently  using  only  local  features.  In  particular,  the  relation 
classifier  does  not  know  the  labels  of  its  entity  arguments, 
and  the  entity  classifier  does  not  know  the  labels  of  rela¬ 
tions  in  the  sentence  either.  Since  basic  classifiers  are 
used  in  all  approaches,  we  describe  how  they  are  trained 
here. 

Eor  the  entity  classifier,  one  set  of  features  are  ex¬ 
tracted  from  words  within  a  size  4  window  around  the 
target  phrase.  They  are:  (1)  words,  part-of-speech  tags, 
and  conjunctions  of  them;  (2)  bigrams  and  tri grams  of 
the  mixture  of  words  and  tags.  In  addition,  some  other 
features  are  extracted  from  the  target  phrase,  including: 


symbol 

explanation 

icap 

the  first  character  of  a  word  is  capitalized 

acap 

all  characters  of  a  word  are  capitalized 

incap 

some  characters  of  a  word  are  capitalized 

suffix 

the  suffix  of  a  word  is  “ing”,  “ment”,  etc. 

bigram 

bigram  of  words  in  the  target  phrase 

len 

number  of  words  in  the  target  phrase 

place^ 

the  phrase  is/has  a  known  place’s  name 

proL 

the  phrase  is/has  a  professional  title  (e.g.  Lt.) 

name^ 

the  phrase  is/has  a  known  person’s  name 

Eor  the  relation  classifier,  there  are  three  sets  of  fea¬ 
tures:  (1)  features  similar  to  those  used  in  the  entity  clas¬ 
sification  are  extracted  from  the  two  argument  entities  of 


^We  collect  names  of  famous  places,  people  and  popular  ti¬ 
tles  from  other  data  sources  in  advance. 


Pattern 

Example 

argi  ,  arg2 

argi  ,  ■  •  •  a  •  •  •  arg2  prof 
in/at  argi  in/at/,  arg2 
arg2  prof  argi 
argi  •  •  ■  native  of  •  •  •  arg2 
argi  •  •  •  based  in/at  arg2 

San  Jose,  CA 

John  Smith,  a  Starbucks  manager  •  •  • 

Officials  in  Perugia  in  Umbria  province  said  •  •  • 

CNN  reporter  David  McKinley  •  •  • 

Elizabeth  Dole  is  a  native  of  Salisbury,  N.C. 

Leslie  Kota,  a  spokeswoman  for  K  mart  based  in  Troy,  Mich,  said  •  •  • 

Table  1;  Some  patterns  used  in  relation  classification 


the  relation;  (2)  conjunctions  of  the  features  from  the  two 
arguments;  (3)  some  patterns  extracted  from  the  sentence 
or  between  the  two  arguments.  Some  features  in  category 
(3)  are  “the  number  of  words  between  argi  and  arg2  ”, 
“whether  argi  and  arg2  are  the  same  word”,  or  “argi  is 
the  beginning  of  the  sentence  and  has  words  that  consist 
of  all  capitalized  characters”,  where  argi  and  org2  rep¬ 
resent  the  first  and  second  argument  entities  respectively. 
In  addition.  Table  1  presents  some  patterns  we  use. 

The  learning  algorithm  used  is  a  variation  of  the  Win¬ 
now  update  rule  incorporated  in  SNoW  (Roth,  1998; 
Roth  and  Yih,  2002),  a  multi-class  classifier  that  is  specif¬ 
ically  tailored  for  large  scale  learning  tasks.  SNoW  learns 
a  sparse  network  of  linear  functions,  in  which  the  targets 
(entity  classes  or  relation  classes,  in  this  case)  are  repre¬ 
sented  as  linear  functions  over  a  common  feature  space. 
While  SNoW  can  be  used  as  a  classifier  and  predicts  us¬ 
ing  a  winner-take-all  mechanism  over  the  activation  value 
of  the  target  classes,  we  can  also  rely  directly  on  the  raw 
activation  value  it  outputs,  which  is  the  weighted  linear 
sum  of  the  active  features,  to  estimate  the  posteriors.  It 
can  be  verified  that  the  resulting  values  are  monotonic 
with  the  confidence  in  the  prediction,  therefore  provide  a 
good  source  of  probability  estimation.  We  use  softmax 
(Bishop,  1995)  over  the  raw  activation  values  as  condi¬ 
tional  probabilities.  Specifically,  suppose  the  number  of 
classes  is  n,  and  the  raw  activation  values  of  class  i  is 
acti.  The  posterior  estimation  for  class  i  is  derived  by  the 
following  equation. 

^acti 


Pipeline,  mimics  the  typical  strategy  in  solving  com¬ 
plex  natural  language  problems  -  separating  a  task  into 
several  stages  and  solving  them  sequentially.  For  exam¬ 
ple,  a  named  entity  recognizer  may  be  trained  using  a  dif¬ 
ferent  corpus  in  advance,  and  given  to  a  relation  classifier 
as  a  tool  to  extract  features.  This  approach  first  trains  an 
entity  classifier  as  described  in  the  basic  approach,  and 
then  uses  the  prediction  of  entities  in  addition  to  other 
local  features  to  learn  the  relation  identifier.  Note  that 
although  the  true  labels  of  entities  are  known  here  when 
training  the  relation  identifier,  this  may  not  be  the  case 


in  general  NLP  problems.  Since  only  the  predicted  en¬ 
tity  labels  are  available  in  testing,  learning  on  the  predic¬ 
tions  of  the  entity  classifier  presumably  makes  the  rela¬ 
tion  classifier  more  tolerant  to  the  mistakes  of  the  entity 
classifier.  In  fact,  we  also  observe  this  phenomenon  em¬ 
pirically.  When  the  relation  classifier  is  trained  using  the 
true  entity  labels,  the  performance  is  much  worse  than 
using  the  predicted  entity  labels. 

LP,  is  our  global  inference  procedure.  It  takes  as  in¬ 
put  the  constraints  between  a  relation  and  its  entity  argu¬ 
ments,  and  the  output  (the  estimated  probability  distribu¬ 
tion  of  labels)  of  the  basic  classifiers.  Note  that  LP  may 
change  the  predictions  for  either  entity  labels  or  relation 
labels,  while  pipeline  fully  trusts  the  labels  of  entity  clas¬ 
sifier,  and  only  the  relation  predictions  may  be  different 
from  the  basic  relation  classifier.  In  other  words,  LP  is 
able  to  enhance  the  performance  of  entity  classification, 
which  is  impossible  for  pipeline. 

The  final  approach.  Omniscience,  tests  the  conceptual 
upper  bound  of  this  entity/relation  classification  problem. 
It  also  trains  the  two  classifiers  separately  as  the  basic 
approach.  However,  it  assumes  that  the  entity  classifier 
knows  the  correct  relation  labels,  and  similarly  the  rela¬ 
tion  classifier  knows  the  right  entity  labels  as  well.  This 
additional  information  is  then  used  as  features  in  training 
and  testing.  Note  that  this  assumption  is  totally  unrealis¬ 
tic.  Nevertheless,  it  may  give  us  a  hint  that  how  much  a 
global  inference  can  achieve. 

4.1  Results 

Tables  2  &  3  show  the  performance  of  each  approach  in 
F/3=i  using  5-fold  cross-validation.  The  results  show  that 
LP  performs  consistently  better  than  basic  and  pipeline, 
both  in  entities  and  relations.  Note  that  LP  does  not  apply 
learning  at  all,  but  still  outperforms  pipeline,  which  uses 
entity  predictions  as  new  features  in  learning.  The  results 
of  the  omniscient  classifiers  reveal  that  there  is  still  room 
for  improvement.  One  option  is  to  apply  learning  to  tune 
a  better  cost  function  in  the  LP  approach. 

One  of  the  more  significant  results  in  our  experiments, 
we  believe,  is  the  improvement  in  the  quality  of  the  deci¬ 
sions.  As  mentioned  in  Sec.  1,  incorporating  constraints 
helps  to  avoid  inconsistency  in  classification.  It  is  in- 


Approach 

Rec. 

person 

Free. 

Fi 

organization 

Rec.  Prec.  Fi 

Rec. 

location 

Prec. 

Fi 

Basic 

89.4 

89.2 

89.3 

86.9 

91.4 

89.1 

68.2 

90.9 

77.9 

Pipeline 

89.4 

89.2 

89.3 

86.9 

91.4 

89.1 

68.2 

90.9 

77.9 

LP 

90.4 

90.0 

90.2 

88.5 

91.7 

90.1 

71.5 

91.0 

80.1 

Omniscient 

94.9 

93.5 

94.2 

92.3 

96.5 

94.4 

88.3 

93.4 

90.8 

Table  2:  Results  of  Entity  Classification 


Approach 

Rec. 

located. 

Prec. 

.in 

Fi 

Rec. 

work_for 

Prec. 

Fi 

orgBased. 
Rec.  Prec. 

.in 

Fi 

Basic 

54.7 

43.0 

48.2 

42.1 

51.6 

46.4 

36.1 

84.9 

50.6 

Pipeline 

51.2 

51.6 

51.4 

41.4 

55.6 

47.5 

36.9 

76.6 

49.9 

LP 

53.2 

59.5 

56.2 

40.4 

72.9 

52.0 

36.3 

90.1 

51.7 

Omniscient 

64.0 

54.5 

58.9 

50.5 

69.1 

58.4 

50.2 

76.7 

60.7 

Approach 

Rec. 

liveJn 

Prec. 

Fi 

Rec. 

kill 

Prec. 

Fi 

Basic 

39.7 

61.6 

48.3 

82.1 

73.6 

77.6 

Pipeline 

42.6 

62.2 

50.6 

83.2 

76.4 

79.6 

LP 

41.5 

68.1 

51.6 

81.3 

82.2 

81.7 

Omniscient 

57.0 

60.7 

58.8 

82.1 

74.6 

78.2 

Table  3:  Results  of  Relation  Classification 


teresting  to  investigate  how  often  such  mistakes  happen 
without  global  inference,  and  see  how  effectively  the 
global  inference  enhances  this. 

For  this  purpose,  we  define  the  quality  of  the  decision 
as  follows.  For  an  active  relation  of  which  the  label  is 
classified  correctly,  if  both  its  argument  entities  are  also 
predicted  correctly,  we  count  it  as  a  coherent  prediction. 
Quality  is  then  the  number  of  coherent  predictions  di¬ 
vided  by  the  sum  of  coherent  and  incoherent  predictions. 
Since  the  basic  and  pipeline  approaches  do  not  have  a 
global  view  of  the  labels  of  entities  and  relations,  5% 
to  25%  of  the  predictions  are  incoherent.  Therefore,  the 
quality  is  not  always  good.  On  the  other  hand,  our  global 
inference  procedure,  LP,  takes  the  natural  constraints  into 
account,  so  it  never  generates  incoherent  predictions.  If 
the  relation  classifier  has  the  correct  entity  labels  as  fea¬ 
tures,  a  good  learner  should  learn  the  constraints  as  well. 
As  a  result,  the  quality  of  omniscient  is  almost  as  good  as 
LP. 

Another  experiment  we  did  is  the  forced  decision  test, 
which  boosts  the  Fi  of  “kill”  relation  to  86.2%.  Here 
we  consider  only  sentences  in  which  the  “kill”  relation 
is  active.  We  force  the  system  to  determine  which  of  the 
possible  relations  in  a  sentence  (i.e.,  which  pair  of  en¬ 
tities)  has  this  relation  by  adding  a  new  linear  equality. 
This  is  a  realistic  situation  (e.g.,  in  the  context  of  ques¬ 
tion  answering)  in  that  it  adds  an  external  constraint,  not 
present  at  the  time  of  learning  the  classifiers  and  it  eval¬ 
uates  the  ability  of  our  inference  algorithm  to  cope  with 


it.  The  results  exhibit  that  our  expectations  are  correct. 
In  fact,  we  believe  that  in  natural  situations  the  number 
of  constraints  that  can  apply  is  even  larger.  Observing 
the  algorithm  performs  on  other,  specific,  forced  deci¬ 
sion  tasks  verifies  that  LP  is  reliable  in  these  situations. 
As  shown  in  the  experiment,  it  even  performs  better  than 
omniscience,  which  is  given  more  information  at  learning 
time,  but  cannot  adapt  to  the  situation  at  decision  time. 

5  Discussion 

We  presented  an  linear  programming  based  approach 
for  global  inference  where  decisions  depend  on  the  out¬ 
comes  of  several  different  but  mutually  dependent  classi¬ 
fiers.  Even  in  the  presence  of  a  fairly  general  constraint 
structure,  deviating  from  the  sequential  nature  typically 
studied,  this  approach  can  find  the  optimal  solution  effi¬ 
ciently. 

Contrary  to  general  search  schemes  (e.g.,  beam 
search),  which  do  not  guarantee  optimality,  the  linear  pro¬ 
gramming  approach  provides  an  efficient  way  to  finding 
the  optimal  solution.  The  key  advantage  of  the  linear 
programming  formulation  is  its  generality  and  flexibility; 
in  particular,  it  supports  the  ability  to  incorporate  classi¬ 
fiers  learned  in  other  contexts,  “hints”  supplied  and  de¬ 
cision  time  constraints,  and  reason  with  all  these  for  the 
best  global  prediction.  In  sharp  contrast  with  the  typi¬ 
cally  used  pipeline  framework,  our  formulation  does  not 
blindly  trust  the  results  of  some  classifiers,  and  therefore 
is  able  to  overcome  mistakes  made  by  classifiers  with  the 


help  of  constraints. 

Our  experiments  have  demonstrated  these  advantages 
by  considering  the  interaction  between  entity  and  rela¬ 
tion  classifiers.  In  fact,  more  classifiers  can  be  added  and 
used  within  the  same  framework.  For  example,  if  coref¬ 
erence  resolution  is  available,  it  is  possible  to  incorporate 
it  in  the  form  of  constraints  that  force  the  labels  of  the  co¬ 
referred  entities  to  be  the  same  (but,  of  course,  allowing 
the  global  solution  to  reject  the  suggestion  of  these  clas¬ 
sifiers).  Consequently,  this  may  enhance  the  performance 
of  entity/relation  recognition  and,  at  the  same  time,  cor¬ 
rect  possible  coreference  resolution  errors.  Another  ex¬ 
ample  is  to  use  chunking  information  for  better  relation 
identification;  suppose,  for  example,  that  we  have  avail¬ 
able  chunking  information  that  identifies  Subj-rVerb  and 
Verb-rObject  phrases.  Given  a  sentence  that  has  the  verb 
“murder”,  we  may  conclude  that  the  subject  and  object  of 
this  verb  are  in  a  “kill”  relation.  Since  the  chunking  in¬ 
formation  is  used  in  the  global  inference  procedure,  this 
information  will  contribute  to  enhancing  its  performance 
and  robustness,  relying  on  having  more  constraints  and 
overcoming  possible  mistakes  by  some  of  the  classifiers. 
Moreover,  in  an  interactive  environment  where  a  user  can 
supply  new  constraints  (e.g.,  a  question  answering  situa¬ 
tion)  this  framework  is  able  to  make  use  of  the  new  in¬ 
formation  and  enhance  the  performance  at  decision  time, 
without  retraining  the  classifiers. 

As  we  show,  our  formulation  supports  not  only  im¬ 
proved  accuracy,  but  also  improves  the  ‘human-like” 
quality  of  the  decisions.  We  believe  that  it  has  the  poten¬ 
tial  to  be  a  powerful  way  for  supporting  natural  language 
inferences. 
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